A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain

نویسندگان

  • Jin-Xia Huang
  • Sun-Mee Bae
  • Key-Sun Choi
چکیده

Sino-Korean words, which are historically borrowed from Chinese language, could be represented with both Hanja (Chinese characters) and Hangeul (Korean characters) writings. Previous Korean Input Method Editors (IMEs) provide only a simple dictionary-based approach for Hangeul-Hanja conversion. This paper presents a sentencebased statistical model for Hangeul-Hanja conversion, with word tokenization included as a hidden process. As a result, we reach 91.4% of character accuracy and 81.4% of word accuracy in terminology domain, when only very limited Hanja data is available.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Missionary contributions toward the revaluation of Hangeul in late nineteenth-century Korea

Soon after their arrival to Korea, Christian missionaries were confronted by decisions regarding how they would present written materials to the Korean people. While many Koreans used their indigenous script (Hangeul) for everyday purposes, higher status literacy materials were expected to be presented using Chinese characters (Hanja), a system unfamiliar to most but considered more prestigious...

متن کامل

Using Context-based Statistical Models to Promote the Quality of Voice Conversion Systems

This article aims to examine methods of optimizing GMM-based voice conversion systems performance in which GMM method is introduced as the basic method for improvement of voice conversion systems performance. In the current methods, due to using a single conversion function to convert all speech units and subsequent spectral smoothing arising from statistical averaging, we will observe quality ...

متن کامل

Sampling Rate Conversion in the Discrete Linear Canonical Transform Domain

Sampling rate conversion (SRC) is one of important issues in modern sampling theory. It can be realized by up-sampling, filtering, and down-sampling operations, which need large complexity. Although some efficient algorithms have been presented to do the sampling rate conversion, they all need to compute the N-point original signal to obtain the up-sampling or the down-sampling signal in the tim...

متن کامل

Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

Owing to the high expressiveness of regular expression, it is frequently used in searching and manipulation of text based data. Regular expression is highly applicable in processing Latin alphabet based text, but the same cannot be said for Hangeul∗, the writing system for Korean language. Although Hangeul possesses alphabetic features within the script, expressiveness of regular expression pat...

متن کامل

Dissociative Disturbance in Hangul-Hanja Reading after a Left Posterior Occipital Lesion

Since the Korean language has two distinct writing systems, phonogram (Hangul) and ideogram (Hanja: Chinese characters), alexia can present with dissociative disturbances in reading between the two systems. A 74-year-old right-handed man presented with a prominent reading impairment in Hangul with agraphia of both Hangul and Hanja after a left posterior occipital- parietal lesion. He could not ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004